Introduction to Open Data Science - Course Project

About the project

This is ‘Open data science’ course. I am familiar with R and R studio. I have used Git earlier but I am not confident enough to use it independently. I am interested in learning about data science and how Statistics can be conveyed to non-statisticians. I was searching for a course not necessarily in Statistics but a course that will talk about data science applicable for any field and some simple ways of learning and using Github. The reviews of earlier students of this course encouraged me to enroll in this course. My github repository for this course: https://github.com/Ashwini-Helsinki/IODS-project


Regression and model validation

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(GGally)
library(ggplot2)

Read the data

The data created in data wrangling exercise is read as ‘data2’ here. The data is about 166 students. Each row describes one student with 7 columns of information. Information such as age, gender, student’s learning approaches (such as deep learning, surface learning and strategic learning) and attitude towards Statistics is given in various columns of the data. Column ‘Points’ gives exam points of each student.

setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data2 <- read.csv("data/learning2014.csv",header = TRUE)

# Dimension of data: rows and columns respectively.
dim(data2)
## [1] 166   7
# Stucture of data
str(data2)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Graphical overview

Summary of each variable in the analysis data is shown below. For each variable its minimum value, maximum value, 1st, 2nd (median) and 3rd quantile and mean value is shown in the summary.

summary(data2)
##     gender               Age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Let’s examine in detail how each of the variable in the data is distributed and the relationship of these variables with each other.

# create graphical overview of variables with ggpairs()
p <- ggpairs(data2, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Above graph shows distribution of each variable in the data and its correlation with other variables in the data. This is a plot matrix with 7 rows and 7 columns. Each row and each column represents one variable from the data.

For all other variables except ‘gender’, density plot is shown in the diagonal position. In all other positions of the same column below the diagonal entry, a scatter plot of joint observations with the variable in the corresponding row are shown. Similarly, in the same row on the right side of the diagonal position shows correlation between the row and column variables.

Since gender variable has only 2 values ‘F’ and ‘M’, above plot has shown histogram for gender variable. Also, its relation with other variables is also shown by histograms (in the first column) and box plots (in the first row) of those variables in each gender.

From the graph, it can be seen that there are more than 100 female students and around 50 male students. Looking at the density plot of ‘Age’, most of the students are of the age less than 30. The long right tail suggests that there are a few students above 30 and up to 60 years of age. Density of ‘attitude’ variable looks normal with span from 1 to 5 with mode around 3. ‘deep’ learning approach has a long left tail. Strategic learning ‘stra’ and surface learning both show slightly bimodal densities. Points has a mode around 22 ans a small amont of observations near 11.

From the correlations, it can be seen that attitude towards Statistics and exam points are positively correlated. It means better the attitude more the exam points. Surface learning is negatively correlated with all other variables.

How these variables are different for two ‘gender’ groups is shown in the following graph. Transparency of the overlapping plots is set by alpha.

# create graphical overview of variables with ggpairs()
p1 <- ggpairs(data2, mapping = aes(col= gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p1

Linear regression

Let’s fit a linear regression model to study effect of three covariates attitude, deep and stra on exam points.

my_model1 <- lm(Points ~ attitude + deep + stra, data= data2)

summary(my_model1)
## 
## Call:
## lm(formula = Points ~ attitude + deep + stra, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## deep         -0.7492     0.7507  -0.998  0.31974    
## stra          0.9621     0.5367   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The ‘t value’ in the summary table shows test statistic for the statistical test of parameter (coefficient) corresponding to each covariate (explanatory variable). It tests if the coefficient is zero or not. If the coefficient is away from zero that means the corresponding explanatory variable has impact on the response (target) variable. If the standardize test statistic (t value) is large ( positive or negative) then the corresponding p-value is small; indicating statistical significance of the covariate.

Summary of the model shows that ‘attitude’ and ‘stra’ have statistically significant relationship with the target variable ‘Points’ since the p-values corresponding to them are small. p-value corresponding to co-variate ‘attitude’ is less than 0.001. The estimate of its coefficient is 3.5254 with standard error 0.5683. Increase of 1 unit in attitude increases the exam points by 3.5254 units.

p-value corresponding to ‘stra’ is less than 0.1. It has coefficient estimate of 0.9621 indicating 0.9621 units increase in exam points when ‘stra’ is increased by 1 unit.

Variable ‘deep’ is not showing significant relationship with ‘Points’ (p-value > 0.3) hence, it is removed from the model.

The intercept term has a large estimate of 11.3915. The p-value corresponding to intercept is also very small indicating that some important covariates are not considered in the model.

However, no new covariate will be added at this stage. As per the instructions in the exercise, we will fit the new model by removing ‘deep’.

my_model2 <- lm(Points ~ attitude + stra , data= data2)

summary(my_model2)
## 
## Call:
## lm(formula = Points ~ attitude + stra, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Summary of the new model is shown above. Relationship of attitude and ‘stra’ has not changed much as compared to the previous model. The model can be written as

Points = 8.9729 + (3.4658 * attitude) + (0.9137 * stra) + (Error term)

Error term is considered to be normally distributed. Multiple R-squared is 0.2048. It means this model does not explain variation in the response variable ‘Points’ well.

To study model fitting in more detail, let’s examine various plots of the models.

par(mfrow = c(1,3))
plot(my_model2, which=c(1,2,5))

Residual vs Fitted plot is a scatter plot and no specific relation can be obtained which will indicate possible relation of the fitted value with the residual. Also, the red line is a bit deviated from the horizontal line but the deviation is not large enough. Hence, it can be observed that size of the error does not depend on the explanatory variables as assumed in linear regression.

Second assumption in linear regression is that error term is normally distributed. Second plot ’Normal Q-Q plot shows that most of the observations are close to the straight line. But some observations (number 35, 56,145) and some observations at the top deviate a bit from the straight line. Points in the middle region follow the assumption of normality of error term.

From the Residual vs Leverage plot, it can be seen that observation number 35, 71 and 145 have extreme leverage as compared to other points. Still, the leverage is not outside Cook’s distance hence, there is no single observation impacting the model.


Logistic Regression

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(dplyr)
library(boot)

Read the data

The data created in data wrangling exercise is read as ‘data3’ here. This data discusses student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. The dataset includes the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). G1, G2 and G3 show average grades earned by the student. ‘.math’ and ‘.por’ suffixes show grades in Maths and Portuguese language courses. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.

setwd("D:/Helsinki/Courses/OpenDataScience/IODS-project/")
data3 <- read.csv("data/Alcdata.csv",header = TRUE)

# Dimension of data: rows and columns respectively.
dim(data3)
## [1] 370  47
# variable names of data
colnames(data3)
##  [1] "school"        "sex"           "age"           "address"      
##  [5] "famsize"       "Pstatus"       "Medu"          "Fedu"         
##  [9] "Mjob"          "Fjob"          "reason"        "guardian"     
## [13] "traveltime"    "studytime"     "failures.math" "schoolsup"    
## [17] "famsup"        "paid.math"     "activities"    "nursery"      
## [21] "higher"        "internet"      "romantic"      "famrel"       
## [25] "freetime"      "goout"         "Dalc"          "Walc"         
## [29] "health"        "absences.math" "G1.math"       "G2.math"      
## [33] "G3.math"       "failures.por"  "paid.por"      "absences.por" 
## [37] "G1.por"        "G2.por"        "G3.por"        "failures"     
## [41] "paid"          "absences"      "G1"            "G2"           
## [45] "G3"            "alc_use"       "high_use"

This data is used in this analysis to study the relationships between high/low alcohol consumption and some of the other variables in the data.

Hypothesis

Let’s choose four variables namely ‘sex’, ‘famrel’, ‘absences’ and ‘failures’. My hypotheses about relationships of these variables with high alcohol consumption are as follows:
1. ‘sex’ may not have direct relationship with high alcohol consumption.
2. ‘famrel’ i.e. Good family relationship has negative impact on high alcohol consumption. This is a categorical variable. Better the family relations less alcohol consumption.
3. ‘absences’ and ‘failures’ have positive impact on high alcohol consumption. More the absences more alcohol consumption.

Numerical and graphical exploration

Let’s observe the relationship of these 4 variables with high alcohol consumption.

  1. ‘sex’ is a binary variable and ‘alcohol consumption’ is also a binary variable. So let’s use a 2x2 table and bar graph to look at their relationship.
# 2x2table
table(data3$sex, data3$high_use)
##    
##     FALSE TRUE
##   F   154   41
##   M   105   70
# bar graph
ggplot(data3, aes(x=sex, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)))+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Both the tabular view and graph show that sex ‘M’ has more percentage of high alcohol consumption than female group ‘F’. It shows that my hypothesis about relationship of ‘sex’ and ‘high/low alcohol consumption’ is not true.

  1. Family relationship

Let’s look at the tabular and graphical representation.

# tabular representation
table(data3$famrel, data3$high_use)
##    
##     FALSE TRUE
##   1     6    2
##   2     9    9
##   3    39   25
##   4   128   52
##   5    77   23
# bar graph
ggplot(data3, aes(x=famrel, fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)), position=position_dodge())+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Both the tabular and graph show that for famrel 1 to 3 categories, relative frequency of high alcohol consumption is substantial. However, in category 4 and 5 which indicates better family relationships, has lower percentage of students with high alcohol consumption. Here my hypothesis is somewhat true.

  1. Absences
# tabular representation
table(data3$absences , data3$high_use)
##     
##      FALSE TRUE
##   0     50   13
##   1     37   13
##   2     40   16
##   3     31    7
##   4     24   11
##   5     16    6
##   6     16    5
##   7      9    3
##   8     14    6
##   9      5    6
##   10     5    2
##   11     2    3
##   12     3    4
##   13     1    1
##   14     1    6
##   16     0    1
##   17     0    1
##   18     1    1
##   19     0    1
##   20     2    0
##   21     1    1
##   26     0    1
##   27     0    1
##   29     0    1
##   44     0    1
##   45     1    0
# box plot
ggplot(data3, aes(x=high_use , y=absences))+ geom_boxplot()

Looking at the box plot, it can be seen that students with high alcohol consumption have more absences. The range in both the boxes is similar but the box with whisker in ‘True’ category is much larger and its box has 3rd quantile much above the median. It indicates that absences and high alcohol consumption are positively correlated. My hypothesis is true to an extent.

  1. Failures
# tabular representation
table(data3$failures , data3$high_use)
##    
##     FALSE TRUE
##   0   238   87
##   1    12   12
##   2     8    9
##   3     1    3
# bar graph
ggplot(data3, aes(x=failures , fill=high_use))+ geom_bar(aes(y = (..count..)/sum(..count..)),position=position_dodge())+ 
          scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies")

Here high alcohol consumption increases with more number of failures. My hypothesis is true.

Logistic regression model

Let’s fit a logistic regression model to study effect of these 4 covariates sex, famrel, absences and failures on high/low alcohol consumption. Though famrel and failures can be looked as categorical variables, the data is not equally distributed over all categories so considering any one of the category as base is difficult. Let’s continue with these variables as continuous variable. ‘Sex’ is a binary variable so one of the categories will be treated as base category and other one will be treated in comparison to the base category.

# Fit logistic regression
my_model1 <- glm(high_use ~ sex +  famrel + absences + failures  , data= data3, family =binomial(link = "logit"))


# summary of the model
summary(my_model1)
## 
## Call:
## glm(formula = high_use ~ sex + famrel + absences + failures, 
##     family = binomial(link = "logit"), data = data3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0786  -0.8216  -0.5746   0.9760   2.1820  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.79406    0.54281  -1.463  0.14350    
## sexM         1.04800    0.25091   4.177 2.96e-05 ***
## famrel      -0.29791    0.13044  -2.284  0.02238 *  
## absences     0.08941    0.02274   3.932 8.43e-05 ***
## failures     0.57328    0.20531   2.792  0.00523 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 401.77  on 365  degrees of freedom
## AIC: 411.77
## 
## Number of Fisher Scoring iterations: 4
# Estimate and confidence interval
cbind(coef(my_model1), confint(my_model1))
##                               2.5 %      97.5 %
## (Intercept) -0.79406437 -1.87558320  0.26100198
## sexM         1.04800182  0.56283743  1.54857812
## famrel      -0.29791173 -0.55591186 -0.04251048
## absences     0.08940969  0.04695413  0.13651044
## failures     0.57327802  0.17701334  0.98884313

‘sexM’ estimate gives coefficient of category ‘M’ when the base category ‘F’ is given by intercept term. All 4 covariates have p-value smaller than 0.05 so all 4 covariates show statistical significant relation with high alcohol consumption. Also, all the coefficients are away from zero and the 95% confidence interval exclude zero for all 4 covariates. To view these effects in terms of odds ratio, let’s take exponential of coefficients and confidence interval.

# Odds ratios by exponentiating coefficients of the model
print("Odds ratios - estimate and confidence interval")
## [1] "Odds ratios - estimate and confidence interval"
cbind(exp(coef(my_model1)), exp(confint(my_model1)))
##                           2.5 %    97.5 %
## (Intercept) 0.4520039 0.1532656 1.2982302
## sexM        2.8519467 1.7556470 4.7047758
## famrel      0.7423669 0.5735490 0.9583804
## absences    1.0935286 1.0480739 1.1462668
## failures    1.7740730 1.1936470 2.6881229

Although, ‘absences’ has odds ratio 1.0935286 which is not far away from 1, its confidence interval excludes zero. Hence for all 4 Odds ratio, confidence intervals exclude 1 indicating statistical relation with the target variable high_use. If this result is compared with initial hypotheses. sexM has effect on high_use hence the initial hypotheses is not true. ‘famrel’ have negative effect on high_use since the odds ratio and 95% interval lie below 1. ‘absences’ has positive impact on high_use. ‘Failures’ clearly has relation with high_use. So for these 3 covariates initial hypothesis is true.

AIC of this model is 411.77. Only when AIC of other model is known it can be compared with this AIC to decide a better model. Model with lower AIC is considered as better fit model.

Prediction

Next step is prediction using the model my_model1. predict() function will give probabilities as the prediction for each row in the dataset. The probabilities greater than 0.5 will be considered as ‘TRUE’ (high_use = TRUE or 1) value of prediction and others as ‘FALSE’ (high_use = FALSE or 0)

# predict() the probability of high_use
probabilities <- predict(my_model1, type = "response")

# add the predicted probabilities to the data to get a nwe dataset 'alc'
alc <- mutate(data3, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

Next table and graph shows prediction accuracy in terms of observed value of high_use and predictions.

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   244   15
##    TRUE     77   34
print("In terms of proportions and with margins")
## [1] "In terms of proportions and with margins"
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65945946 0.04054054 0.70000000
##    TRUE  0.20810811 0.09189189 0.30000000
##    Sum   0.86756757 0.13243243 1.00000000
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))

# define the geom as points and draw the plot
g + geom_point()

To compute prediction error in terms of average number of wrong predictions, let’s define a loss function as given in the DataCamp.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2486486

Average number of wrong predictions with the model my_model1 are 0.24 (aprox).

Cross validation

Let’s perform 10-fold cross validation for ‘alc’ data, model my_model1 and the loss function defined above.

# K-fold cross-validation

cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model1, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2567568

With 10 fold cross validation, the average number of wrong predictions is near 0.26 (aprox.) ( The exact value is not mentioned here, since the initial seed is not set here, so each run of cv function gives slightly different value. But all of them are close to 0.26) which is more than the training error 0.24. This is expected since the training error is computed for prediction of observations used for training the model while CV error is obtained from the prediction of the observations which are not used for training the model. Training error can be viewed as error given by only estimation error. The uncertainty due to new unknown data is not there. Hence, CV error is always greater than the training error.

Looking at the cv error, one can claim that performance of this model is similar to that of the model given in DataCamp.

Next, let’s consider large number of predictors to start with and keep reducing 1 predictor in each step based on their deviance. In each step one covariate, which contributes to the error the most, is removed from the model. The process continues till no further improvement is achieved by removing a covariate. This can be performed using ‘step’ function on the full model.

# formula using 26 covariates in the data is created.
BiggerFormula <- as.formula(high_use ~sex + age+famsize+Pstatus+Medu+Fedu+Mjob+
  Fjob+guardian+traveltime+studytime+schoolsup+famsup+activities+nursery+higher+        
  internet+romantic+famrel+freetime+goout+health+failures+paid+absences+G3)

# glm is fitted to the bigger formula.
FullModel <- glm(BiggerFormula,family=binomial(link="logit"),data=data3)

# For loop to reduce 1 covariate in every step.
finalDF <- c() # to save final result
numcov = 26 # initial number of covariates

for( mm in 1:26){
  
  prevnumcov= numcov
  
  # perform stepwise backward elimination with steps = mm 
  Newmodel <- step(FullModel,direction="backward",trace=FALSE, steps =mm)
  
  numcov=length(coef(Newmodel))-1
  
  if(numcov == prevnumcov){ 
    print(paste0("No further improvement after ",prevnumcov, " variables." ))
    break
  }
  # predict() the probability of high_use
  probabilities <- predict(Newmodel, type = "response")
  
  # add the predicted probabilities to the data to get a nwe dataset 'alc'
  alc1 <- mutate(alc, probability1 = probabilities)
  
  # use the probabilities to make a prediction of high_use
  alc1 <- mutate(alc1, prediction1 = probability1 > 0.5)
 
  trainerror = loss_func(class = alc1$high_use, prob = alc1$probability1)
  
  cv <- cv.glm(data = alc1, cost = loss_func, glmfit = Newmodel, K = 10)
  
  # average number of wrong predictions in the cross validation
  newcv = cv$delta[1]
  finalDF <- rbind(finalDF, c(numcov,trainerror,'Training'),c(numcov,newcv,'Prediction'))
}
## [1] "No further improvement after 9 variables."
finalDF = data.frame(finalDF)
colnames(finalDF) <- c('Number of covariates', 'Error', 'Type')

ggplot(finalDF, aes(x= `Number of covariates`, y=as.numeric(Error), col=Type)) + 
  geom_point() + scale_y_continuous(labels = scales::number_format(accuracy = 0.01))

For models with large number of covariates, training error is smaller but prediction error is large. Models with properly chosen less number of covariates have training error not largely different but much lesser prediction error.


Clustering and classification

First thing is to load required libraries or R packages for this exercise. Also set random seed for reproducibility of the results.

# Required libraries
library(MASS)
library(corrplot)
library(tidyverse)
library(plotly)
set.seed(12345)

Read the data

‘Boston’ data from ‘MASS’ package is read here. The data is about housing values and factor affecting housing values in Suburbs of Boston. The data has 506 rows and 14 columns. Each row has 14 aspect of the suburb line residential area, business area, pupil-teacher ratio, taxes etc. Let’s explore the data in more details.

# load the data
data("Boston")

# Structure of the dataset 

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# Dimension of the dataset
dim(Boston)
## [1] 506  14

Summaries and distributions of the variables.

# Summary the dataset
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# distribution of variables and their relationships with each other
pairs(Boston)

# Better graphical presentation with ggpairs()
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

# draw the plot
p

Distribution of all the variables can be seen from the above plot. ‘rm’ i.e Average number of rooms per dwelling has a nice bell shape curve suggesting normal distribution with mean 6.208. Variables ‘dis’, ‘Istat’ and ‘medv’ have Normal distribution curve with longer right tail. ‘indus’ and ‘tax’ seems bimodal.

To explore the relationships between the variables let’s use ‘corrplot’

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# print the correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

p

‘medv’(median value of owner-occupied homes in $1000s.) shows correlation with all the variables. Except variable ‘chas’ (binary variable ‘tract bounds river’), all other variables have good correlation (positive or negative) with each other.

From the colorful corrplot, it can be seen that the highest positive correlation is between ‘tax’ (full-value property-tax rate per $10,000.) and ‘rad’ (index of accessibility to radial highways).

dis (weighted mean of distances to five Boston employment centres) has strong negative correlation with indus(proportion of non-retail business acres per town), age(proportion of owner-occupied units built prior to 1940) and nox(nitrogen oxides concentration (parts per 10 million)) indicating that near the employment centres there is higher proportion of non-retail business, higher proportion of owner-occupied units built prior to 1940 and higher nitrogen oxides concentration. As expected, Istat(lower status of the population (percent)) and medv(median value of owner-occupied homes in $1000s) are also highely negatively correlated.

‘chas’ (binary variable ‘tract bounds river’) has hardly any correlation with other variables. It is clear that although in the human history, human beings chose to stay near rivers or water sources in the ancient days, today it has less impact on choosing a housing.

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

After scaling, all the variable ranges have reduced. Most of the 1st and 3rd quantiles are between (or near) -1 and 1. Next step is creating categorical variable of crime with each category indicating level of crime. Then the original variable ‘crim’ will be removed and this categorical variable will be added as a new column.

# create data frame object
boston_scaled <- as.data.frame(boston_scaled)

# create a categorical variable 'crime' with cut points as quantiles of the variable 'crim'
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Let’s create train data by randomly selecting 80% of the rows of the data. Remaining rows will be treated as test data.

# choose randomly 80% of the rows for training set
ind <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Linear discriminant analysis (LDA)

Linear discriminant analysis is performed on train data using crime as target variable and all other variables as covariates.After LDA fit is obtained the biplot is plotted.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2500000 0.2376238 0.2475248 
## 
## Group means:
##                  zn      indus          chas        nox          rm        age
## low       1.0411093 -0.9021842 -0.1251477505 -0.8847044  0.40705116 -0.8865465
## med_low  -0.0833036 -0.2819395  0.0005392655 -0.5730096 -0.10985904 -0.3410669
## med_high -0.3871935  0.2004107  0.1787970010  0.4091306  0.04385536  0.4041791
## high     -0.4872402  1.0171519 -0.1148450583  1.0581412 -0.39225400  0.8234116
##                 dis        rad        tax     ptratio       black      lstat
## low       0.9165942 -0.6974378 -0.7426399 -0.41762332  0.38119357 -0.7516720
## med_low   0.3466853 -0.5406779 -0.4979925 -0.02930992  0.32091020 -0.1609564
## med_high -0.3803789 -0.4925764 -0.3679189 -0.34561698  0.07515175  0.0204888
## high     -0.8428566  1.6377820  1.5138081  0.78037363 -0.76621760  0.9513763
##                medv
## low       0.4874449
## med_low   0.0085978
## med_high  0.1667764
## high     -0.7788111
## 
## Coefficients of linear discriminants:
##                 LD1          LD2          LD3
## zn       0.09932684  0.660082422 -0.924446780
## indus    0.20684196 -0.173509150  0.355125653
## chas    -0.03474956 -0.074291398  0.125240743
## nox      0.26876697 -0.748634714 -1.377750333
## rm       0.07723357  0.012536636 -0.123701216
## age      0.08442120 -0.325512288 -0.120973431
## dis     -0.08550872 -0.218908959  0.157590088
## rad      4.49240099  0.992157796  0.006802416
## tax      0.15738914 -0.102939131  0.427837042
## ptratio  0.10718133 -0.003208959 -0.226272327
## black   -0.02962752  0.097919438  0.165267471
## lstat    0.21600800 -0.222397929  0.317757617
## medv     0.03067033 -0.494810214 -0.193725250
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9702 0.0230 0.0068
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

‘rad’ has a longest arrow so it is the most discriminant among all the predictors. ‘rad’ can differentiate between crime categories well.

LDA prediction

We would like to predict ‘crime’ categories for the test data so keep a copy of true ‘crime’ categories of the test data in ‘correct_classes’ and then delete ‘crime’ column from test data.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       7        0    0
##   med_low    5      16        4    0
##   med_high   0       9       16    5
##   high       0       0        0   27

The cross tabulation show that category ‘high’ in the test data is very well predicted. Category ‘med_high’ shows worst categorization prediction. Out of 30 ‘med_high category in the test data, only 16 are correctly predicted. ’low’ and ‘med_low’ categories are predicted 2/3 times correctly.

K-means clustering

Let’s consider the Boston data again. Compute various distances like euclidean distance, manhattan distance on the scaled Boston data.

data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)


# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of euclidean distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of manhattan distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Even on the scaled data, two different distances have very different range. Let’s perform k-means clustering with 3 clusters.

# k-means clustering
km3 <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km3$cluster)

Overall, the graphs show groups of 3 colors but all 3 colors are not seen in all the graphs. Also, some times they do not properly form clusters.

To find appropriate number of clusters, consider k= 1 to 10 all 10 values one by one.

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot shows that from cluster 1 to 2 there is large improvement. There is not much further improvement. If we are interested in very small SS (sum of squares) then one can consider more clusters. However, having too many clusters is not always useful if they do not actually differ in all the variables. Let’s go ahead with k =2 which has good improvement over k=1.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

Almost all the smaller graphs show good separation of 2 groups with 2 colors. Hence, k=2 was a good choice.

LDA on k-means clusters

Let’s use k=4 to generate clusters. Then perform LDA to check how these 4 clusters are analyzed using linear discrimination analysis.

# k-means clustering
km4 <-kmeans(boston_scaled, centers = 4)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km4$cluster)

boston_scaled1 = data.frame(boston_scaled)
boston_scaled1= boston_scaled1 %>%
  mutate(cluster =km4$cluster )

# linear discriminant analysis
lda.fit4 <- lda(cluster ~ ., data = boston_scaled1)

# print the lda.fit object
lda.fit4
## Call:
## lda(cluster ~ ., data = boston_scaled1)
## 
## Prior probabilities of groups:
##         1         2         3         4 
## 0.2727273 0.4051383 0.1185771 0.2035573 
## 
## Group means:
##         crim         zn      indus        chas        nox         rm        age
## 1  0.9756821 -0.4872402  1.0939296 -0.21526964  0.9652909 -0.4642740  0.7722565
## 2 -0.3888773 -0.3531707 -0.3412963 -0.27232907 -0.4124824 -0.1646322 -0.1241405
## 3 -0.2131078 -0.3300240  0.4860617  1.49936604  0.9890166  0.1347329  0.7475175
## 4 -0.4091050  1.5479668 -1.0695170 -0.04298342 -1.0484685  0.8712179 -1.2230451
##          dis        rad        tax    ptratio       black       lstat
## 1 -0.8278800  1.4598695  1.4701648  0.8275348 -0.71273723  0.93140384
## 2  0.1406541 -0.5964345 -0.6080622  0.1676730  0.31637467 -0.15551283
## 3 -0.7281887 -0.2889628 -0.1777283 -1.2881927 -0.04951573  0.01817238
## 4  1.2534434 -0.6005355 -0.6559834 -0.6920507  0.35409586 -0.94897033
##          medv
## 1 -0.78074928
## 2 -0.08063264
## 3  0.47647538
## 4  0.92897641
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## crim     0.012656810  0.02428702 -0.14961045
## zn       0.095335547  0.40053401 -1.13758796
## indus   -0.504615517 -0.19850861 -0.24468003
## chas     0.157622073 -0.86698142 -0.33492529
## nox      0.225995414 -0.89706014 -0.52034056
## rm       0.021371579  0.30917185 -0.33017316
## age     -0.167677382 -0.43903833  0.37200499
## dis      0.362302731 -0.13797266 -0.43019018
## rad     -1.477515958  0.41995946 -0.16247733
## tax     -0.832279137  0.18854689 -0.61346017
## ptratio  0.007751927  0.89661450  0.30890506
## black    0.005638473  0.08388227  0.07581497
## lstat   -0.199096514  0.28003158 -0.34738897
## medv     0.217842616 -0.11520682 -0.55811203
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.6775 0.1791 0.1435
# target classes as numeric
classes <- as.numeric(boston_scaled1$cluster)

# plot the lda results
plot(lda.fit4, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit4, myscale = 2)

Variables with longer arrows in the plot are the best linear discriminants for these 4 groups.

Visualization using plotly

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Dimensionality reduction techniques

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(GGally)
library(corrplot)
library(FactoMineR)

Read the data

‘human’ data created in the data wrangling exercise is read here. Let’s explore the data in more details.

# read “Human.csv” dataset saved last week
human <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/human.csv", sep=',', header=TRUE)


# summary of variables
summary(human)
##     GNI_Cap         LifeExpect       ExpEduYr          MMR        
##  Min.   :   581   Min.   :49.00   Min.   : 5.40   Min.   :   1.0  
##  1st Qu.:  4198   1st Qu.:66.30   1st Qu.:11.25   1st Qu.:  11.5  
##  Median : 12040   Median :74.20   Median :13.50   Median :  49.0  
##  Mean   : 17628   Mean   :71.65   Mean   :13.18   Mean   : 149.1  
##  3rd Qu.: 24512   3rd Qu.:77.25   3rd Qu.:15.20   3rd Qu.: 190.0  
##  Max.   :123124   Max.   :83.50   Max.   :20.20   Max.   :1100.0  
##       ABR         PercentParlRepre   edu2Ratio         LabRatio     
##  Min.   :  0.60   Min.   : 0.00    Min.   :0.1717   Min.   :0.1857  
##  1st Qu.: 12.65   1st Qu.:12.40    1st Qu.:0.7264   1st Qu.:0.5984  
##  Median : 33.60   Median :19.30    Median :0.9375   Median :0.7535  
##  Mean   : 47.16   Mean   :20.91    Mean   :0.8529   Mean   :0.7074  
##  3rd Qu.: 71.95   3rd Qu.:27.95    3rd Qu.:0.9968   3rd Qu.:0.8535  
##  Max.   :204.80   Max.   :57.50    Max.   :1.4967   Max.   :1.0380
# visualize variables from the dataset 'human' 
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

Above tables and graphs show a lot of variation among countries based on the variables in ‘human’ data. ‘Gross National Income per capita’ and ‘Maternal mortality ratio’ shows the variation the most. For both these columns mean value is away from median, suggesting a skewed distribution. These two columns are negatively correlated. ‘Life expectancy at birth’ has a longer left tail indicating a few nations with smaller life expectancy values. Female to male ratio in education and Proportion in the labor force has mean above 0.7 but both have longer tail on left side indicating many countries with gender inequality. Longer right tails of ‘Maternal mortality ratio’ and ‘Adolescent birth rate’ show that the values are generally smaller but these problems exist in some of the countries.

About relationships between these values: The correlation plot shows that ‘Life expectancy at birth’ and ‘Expected years of schooling’ are positively correlated (Correlation Coeffficient = 0.789). ‘Maternal mortality ratio’ and ‘Adolescent birth rate’ are also positively correlated (CC: 0.759). ‘Life expectancy at birth’ is highly negatively correlated with ‘Maternal mortality ratio’ (CC =-0.857). ‘Life expectancy at birth’ is negatively correlated with ‘Adolescent birth rate’ (CC = -0.729). Similarly ‘Expected years of schooling’ is negatively correlated with both ‘Maternal mortality ratio’ and ‘Adolescent birth rate’. Also female to male ration of secondary education is negatively correlated with ‘Maternal mortality ratio’ so increasing female education percentage could help in the problem oof ‘Maternal mortality ratio’.

Principal component analysis (PCA)

Let’s perform principal component analysis (PCA) on the non standardized human data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# Variation shown by principal component analysis
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

now perform principal component analysis (PCA) on the standardized human data.

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)

# Variation shown by principal component analysis
s <- summary(pca_human_std)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 


# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2] , title='Principal Component Analysis',xlim=c(-0.2,0.55))
legend(10,18, c("MMR: Maternal mortality ratio", "ABR: Adolescent birth rate", "GNI_Cap: Gross National Income per capita", "LifeExpect: Life expectancy at birth", "ExpEduYr: Expected years of schooling","edu2Ratio: Female to male ratio of secondary education", "LabRatio: Female to male ratio of Labor force proportion", "PercentParlRepre: % female representatives in parliament"), col =rep('pink',8))   #, trace = TRUE)

The results of both analysis (with and without standardizing) are different. Principal components are computed based on the variance. When data without standardization is used the variable with large values (higher scale) contribute more hence ‘Gross National Income per capita’(GNI_Cap) shows a long arrow parallel to X-axis and other arrows cannot be clearly seen. In standardized data, all variables are in the same scale so contribution of each variable in the principal component can be studied properly.

Looking at the plot above, it is clear that two variable ‘Maternal mortality ratio’ (MMR) and ‘Adolescent birth rate’ (ABR) are positively correlated with the first principal component and 4 variables ‘Gross National Income per capita’(GNI_Cap), ‘Life expectancy at birth’ (LifeExpect), ‘Expected years of schooling’(ExpEduYr) and ‘female to male ratio of secondary education’ (edu2Ratio) are negatively correlated with the first principal component. The first principal component explains 53.6% variability in the data and the second principal component explains 16.2% variability in the data. The second principal component is negatively correlated with ‘female to male ratio of Labor force proportion’ (LabRatio) and ‘Percentage of female representatives in parliament’ (PercentParlRepre).

Multiple Correspondence Analysis

Let’s read the ‘tea’ data from FactoMineR package and explore more details about the data.

data("tea")


# Structure and dimension of the data 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

The data has 36 variables (columns) describing 300 individuals(rows). Let’s consider 6 columns for further analysis using multiple correspondence method.

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##                   where       age_Q   
##  chain store         :192   15-24:92  
##  chain store+tea shop: 78   25-34:69  
##  tea shop            : 30   35-44:40  
##                             45-59:61  
##                             +60  :38
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ age_Q: Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Above table shows summary of variables selected for the analysis and the plot shows distribution of data among various categories in each variable. In the next step, let’s perform MCA on these selected data columns.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.313   0.266   0.249   0.205   0.184   0.175   0.168
## % of var.             13.434  11.404  10.689   8.791   7.883   7.511   7.218
## Cumulative % of var.  13.434  24.838  35.527  44.318  52.201  59.712  66.930
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.147   0.141   0.135   0.119   0.101   0.072   0.056
## % of var.              6.302   6.051   5.787   5.086   4.348   3.094   2.403
## Cumulative % of var.  73.231  79.282  85.069  90.155  94.503  97.597 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.068  0.005  0.002 |  0.048  0.003  0.001 | -0.471
## 2                  |  0.142  0.021  0.009 |  0.089  0.010  0.004 | -1.069
## 3                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 4                  | -0.798  0.678  0.665 | -0.240  0.072  0.060 |  0.064
## 5                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 6                  | -0.582  0.360  0.361 | -0.167  0.035  0.030 | -0.235
## 7                  | -0.190  0.039  0.022 | -0.036  0.002  0.001 | -0.411
## 8                  |  0.150  0.024  0.009 |  0.307  0.118  0.036 | -0.880
## 9                  |  0.268  0.076  0.026 |  0.900  1.016  0.290 |  0.175
## 10                 |  0.605  0.390  0.137 |  0.870  0.948  0.282 | -0.073
##                       ctr   cos2  
## 1                   0.296  0.106 |
## 2                   1.527  0.527 |
## 3                   0.481  0.297 |
## 4                   0.005  0.004 |
## 5                   0.481  0.297 |
## 6                   0.074  0.059 |
## 7                   0.226  0.103 |
## 8                   1.035  0.298 |
## 9                   0.041  0.011 |
## 10                  0.007  0.002 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test
## black              |  0.750  7.377  0.184  7.421 |  0.467  3.365  0.071  4.618
## Earl Grey          | -0.389  5.179  0.273 -9.036 | -0.016  0.010  0.000 -0.367
## green              |  0.594  2.063  0.044  3.610 | -0.954  6.272  0.113 -5.800
## alone              | -0.096  0.316  0.017 -2.253 | -0.262  2.803  0.128 -6.183
## lemon              |  0.483  1.366  0.029  2.938 |  0.202  0.282  0.005  1.229
## milk               | -0.091  0.093  0.002 -0.813 |  0.315  1.307  0.026  2.811
## other              |  0.938  1.404  0.027  2.853 |  2.736 14.071  0.232  8.321
## tea bag            | -0.460  6.376  0.277 -9.096 | -0.154  0.842  0.031 -3.046
## tea bag+unpackaged |  0.174  0.504  0.014  2.031 |  0.719 10.150  0.236  8.400
## unpackaged         |  1.718 18.837  0.403 10.972 | -1.150  9.948  0.180 -7.346
##                       Dim.3    ctr   cos2 v.test  
## black              | -0.740  9.026  0.179 -7.322 |
## Earl Grey          |  0.334  4.788  0.201  7.751 |
## green              | -0.293  0.629  0.011 -1.778 |
## alone              | -0.047  0.098  0.004 -1.118 |
## lemon              |  1.264 11.746  0.197  7.685 |
## milk               | -0.379  2.011  0.038 -3.375 |
## other              | -0.957  1.836  0.028 -2.910 |
## tea bag            | -0.436  7.208  0.249 -8.627 |
## tea bag+unpackaged |  0.663  9.193  0.200  7.740 |
## unpackaged         |  0.330  0.873  0.015  2.107 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.275 0.154 0.216 |
## How                | 0.060 0.295 0.235 |
## how                | 0.484 0.334 0.259 |
## sugar              | 0.132 0.013 0.200 |
## where              | 0.528 0.632 0.233 |
## age_Q              | 0.402 0.169 0.354 |
# visualize MCA

plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic", title='Multiple Correspondence Analysis', palette = c('black','green','red','blue','violet','brown'))
legend(2,3, c("Tea", "How", "how", "sugar", "where", "age_Q"), col =c('black','green','red','blue','violet','brown'), pch = 17)

Looking at the graph, it can be seen that dimension 1 primarily shows variation in ‘where’, ‘how’, ‘sugar’ and ‘age_Q’. Vertical dimension primarily shows variation in ‘Tea’ and ‘How’. The tea packing (how) and where it is bought (where) go together. Correspondence between Age(age_Q), type of tea (Tea), ‘sugar’ and additive in tea (How) can be seen clearly. Both these dimensions explain only 13% and 11% inertia respectively. Some more dimensions should be studied. Also, the model fit need to be investigated further. However, the analysis is concluded here with these 6 columns.


Analysis of longitudinal data

First thing is to load required libraries or R packages for this exercise.

# Required libraries
library(lme4)

‘RATS’ data

‘RATS’ data in wide and long form created in the data wrangling exercise is read here.

# read “RATS.csv” and “RATSL.csv” dataset 
RATS <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/RATS.csv", sep=',', header=TRUE)
RATSL <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/RATSL.csv", sep=',', header=TRUE)
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

Individual profiles by Group

Let’s explore the data in more details. Following figure shows group-wise plots of each rat’s growth profile by line graph.

# Draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, col=ID)) +
  geom_line() +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "right") +
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

First group has rats with low starting weight. The increase in the weight is also small for this group. Group 2 and 3 have rats with higher weight and the increase in the weight during those 9 weeks is also considerably high in these two groups. Rat Id 12 has the highest starting weight 555 gram and it remained higher than all other rats through out these 9 weeks period. Rat id 2 has the lowest weight for all these 9 weeks.

Plot with standardized data

Let’s plot similar graph with standardized data. The standardized values are obtained by subtracting the mean of observation of the same week-day from the original observation and then dividing by the corresponding week-day’s standard deviation.

# Standardise the variable Weight
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdwt = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()


# Plot again with the standardised bprs
ggplot(RATSL, aes(x = Time, y = stdwt, col = ID)) +
  geom_line() +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "right") +
  scale_y_continuous(name = "standardized weight")

In standardized data plot, group 1 shows no increasing or decreasing trend. Rat Id 2 still remains lower than all other. Group 2 was showing much higher slope earlier but after standardization, one RAT observations show decreasing trend. Similarly for group 3, all show non-increasing trend.

Summary measure plot

# Summary data with mean and standard error of weight by Group and Time 
wtSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n()) ) %>%
  ungroup()

# Glimpse the data
glimpse(wtSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 3...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375...
## $ se    <dbl> 5.381640, 4.629100, 4.057346, 4.808614, 3.909409, 4.166190, 3...
# Plot the mean profiles
ggplot(wtSS, aes(x = Time, y = mean, col = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width=0.3) +
  theme(legend.position = "c(0.8,0.5)") +
  scale_y_continuous(name = "mean(weight) +/- se(weight)")

Above plot shows progression of each group over the period of 9 weeks. The filled circle shows mean value and vertical line shows (mean(weight) +/- se(weight)) i.e. variation in the data. Group 1 has small SE values means small variation in the observation. Group 3 has more variation among rats as compared to group 1. Group 3 has highest mean weight for all weeks. Group 2 has the RAT with highest weight (Rat id 12) but the mean weight of group 2 is lower than that of group 3. Because of one observation is very different than others, group 2 has highest variation which can be seen from the plot.

Boxplot

Let’s explore box plot of this data. Compute mean value for each rat excluding baseline or 1st week’s observation. Plot boxplot for each group.

# Create a summary data by Group and ID with mean as the summary variable (ignoring baseline Time(days) 1).
RATSLSS <- RATSL %>%
  filter(Time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()

# Glimpse the data
glimpse(RATSLSS)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9...
# Draw a boxplot of the mean versus Group
ggplot(RATSLSS, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=3, fill = "black") +
  theme(legend.position = "right") +
  scale_y_continuous(name = "mean(Weight), Time(Day)")

The box plot of all the data shows clearly that 3 groups have clearly different weight measurements. Group 1 has lowest values with all less than 300 grams and low variation. One observation (Rat no 2) is an outlier. Group 2 clearly shows very high variation with one observation (Rat no. 12) as an outlier. Because of high weight of rat ID 12 the mean value is much higher in the box plot and median is near the 1st quantile. Group 3 has moderate variation among its observations with 1 outlier.

Let’s remove the two extreme points of the dataset Rat 2 and Rat 12 and rerun the boxplot.

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSLSS1 <- RATSLSS %>%
  filter(mean < 550 & mean > 250)
  
# Draw a boxplot of the mean versus Group
ggplot(RATSLSS1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=3, fill = "black") +
  scale_y_continuous(name = "mean(Weight), Time(Day)")

After removing the outlier, group 2 has low variation among its observations. It’s mean value is almost in the middle of the box plot. We have not changed group 3 so the figure remains the same. The outlier of group 3 also can be removed but that value is not to the extreme so it is kept in this analysis.

Anova

We have 3 groups so instead of t-test, let’s perform Anova. The three groups are clearly different from the first observation itself. If mean value is simply compared with groups, it will show clear difference among groups. But to check if the mean values differ because of baseline value or because of properly grouping the rats, let’s perform linear regression with mean as response and baseline and group as covariates. Then compute ANOVA (analysis of variace).

# Add the baseline from the original data as a new variable to the summary data
RATSLSS2 <- RATSLSS %>%
  mutate(baseline = RATS$WD1)

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSLSS2)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value   Pr(>F)    
## baseline   1 253625  253625 1859.8201 1.57e-14 ***
## Group      2    879     439    3.2219  0.07586 .  
## Residuals 12   1636     136                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

‘baseline’ is the most important covariate so the difference in mean values is mainly because of the baseline values. ‘Group’ can also be considered as an important covariate if type-1 error is 0.1.

BPRS data

Let’s perform analysis on BPRS data.

# read “BPRS.csv” and “BPRSL.csv” dataset 
BPRS <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/BPRS.csv", sep=',', header=TRUE)
BPRSL <- read.table("D:/Helsinki/Courses/OpenDataScience/IODS-project/data/BPRSL.csv", sep=',', header=TRUE)
BPRSL$treatment <- factor(BPRSL$treatment )
BPRSL$subject <- factor(paste0(BPRSL$subject, "_",BPRSL$treatment) )

Profile plot for each patient(subject) with linetype showing their tretament ID.

# Plot the RATSL data
ggplot(BPRSL, aes(x = week, y = bprs, col=subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "week",breaks = seq(0,12,3)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "right")

From the plot, it can be seen that subjects of both the treatment groups have overlapping line graphs. So, the two treatment groups don’t seem to be very different. Subject 11 on treatment 2 clearly has higher measurements from start to end of this follow-up. Overall decreasing trend can be seen for most of the subjects. Following each line of 40 patients is not possible also Plotting scatter plot using ‘pairs’ for 40 subjects is possible but difficult to interprete anything from those 40 X 40 small plots.

Hence, let’s perform regression analysis to check effect of treatment id and duration of treatment on ‘bprs’ measurements. First, perform regression analysis without considering the repeated measures i.e. the correlation among observations of the same patient is ignored here.

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

As expected, treatment has not shown significant impact on the ‘bprs’ measurement. Since treatment is a binary variable, treatment 1 is considered as baseline and treatment 2 is shown here as model term. Duration of treatment (week) and intercept term both have small p-values suggesting their importance. Estimate for ‘week’ is negative so both the treatments are showing impact equally with increase in duration (weeks) there is decrease in ‘bprs’.

Here, We have considered only the independent model with treatment and duration of treatment as covariates. To understand how the correlation of same subject’s observation affects the model and to understand the effect of randomness due to each subject, let’s perform mixed effect models.

  1. Random intercept model

Here we will again fit regression model where for each subject slope will be same but the intercept term will be different.

# access library lme4
library(lme4)

# Create a random intercept model
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2582.9   2602.3  -1286.5   2572.9      355 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.27506 -0.59909 -0.06104  0.44226  3.15835 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 97.39    9.869   
##  Residual             54.23    7.364   
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.3521  19.750
## week         -2.2704     0.1503 -15.104
## treatment2    0.5722     3.2159   0.178
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.256       
## treatment2 -0.684  0.000

Estimates of week and treatment have not changed drastically but variation in week has reduced which can be seen from its std error values. Randomness due to difference in subjects was not considered earlier, that’s why the std. error of week was inflated. Now, we have a basic mixed effect model.

Please note that treatment2 is still not showing important contribution in the model so it can be removed. However, to reproduce all the analysis as in chapter 9, I will continue using it for further analysis.

Next, let’s use random slope model along with random intercept. In this model intercept and slope (estimate of week) both can be different for each subject.

# create a random intercept and random slope model
BPRSL_ref1 <- lmer(bprs ~ week + treatment +(week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.2   2550.4  -1254.6   2509.2      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4655 -0.5150 -0.0920  0.4347  3.7353 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 167.827  12.955        
##           week          2.331   1.527   -0.67
##  Residual              36.747   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  45.9830     2.6470  17.372
## week         -2.2704     0.2713  -8.370
## treatment2    1.5139     3.1392   0.482
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.545       
## treatment2 -0.593  0.000
# perform an ANOVA test on the two models
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## BPRSL_ref     5 2582.9 2602.3 -1286.5   2572.9                         
## BPRSL_ref1    7 2523.2 2550.4 -1254.6   2509.2 63.663  2  1.499e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In random effect, most of the variance is explained by the intercept term but the residuals are substantially lower than earlier model. Also the anova test suggest that ‘random slope and random intercept’ model is better than only ‘random intercept’ model. AIC of the new model is also lower than AIC of earlier model.

Next, let’s consider interaction term of week and treatment in the fixed effect.

# create a random intercept and random slope model
BPRSL_ref2 <- lmer(bprs ~ week*treatment  +(week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2523.5   2554.5  -1253.7   2507.5      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4747 -0.5256 -0.0866  0.4435  3.7884 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 164.204  12.814        
##           week          2.203   1.484   -0.66
##  Residual              36.748   6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.9840  16.047
## week             -2.6283     0.3752  -7.006
## treatment2       -2.2911     4.2200  -0.543
## week:treatment2   0.7158     0.5306   1.349
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.668              
## treatment2  -0.707  0.473       
## wek:trtmnt2  0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRSL_ref2, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## BPRSL_ref2: bprs ~ week * treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1    7 2523.2 2550.4 -1254.6   2509.2                    
## BPRSL_ref2    8 2523.5 2554.6 -1253.7   2507.5  1.78  1     0.1821

The new model with interaction term does not show any improvement on earlier model. Further, compute the fitted values from this model and plot the observed values and fitted values.

# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = bprs, col=subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Observed weight (grams)") +
  theme(legend.position = "right")

# Create a vector of the fitted values
Fitted <- fitted(BPRSL_ref2)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(Fitted)

# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted, col = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted weight (grams)") +
  theme(legend.position = "right")

Since, treatment has not shown importance in the model, in the next section, let’s fir a model without tretament term.

# create a random intercept and random slope model
BPRSL_ref3 <- lmer(bprs ~ week  +(week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPRSL_ref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2521.5   2544.8  -1254.7   2509.5      354 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4684 -0.5102 -0.0922  0.4365  3.7342 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 165.48   12.864        
##           week          2.33    1.526   -0.66
##  Residual              36.75    6.062        
## Number of obs: 360, groups:  subject, 40
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.7400     2.1175  22.073
## week         -2.2704     0.2712  -8.371
## 
## Correlation of Fixed Effects:
##      (Intr)
## week -0.669
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00293547 (tol = 0.002, component 1)
# perform an ANOVA test on the two models
anova(BPRSL_ref3, BPRSL_ref1)
## Data: BPRSL
## Models:
## BPRSL_ref3: bprs ~ week + (week | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## BPRSL_ref3    6 2521.5 2544.8 -1254.7   2509.5                     
## BPRSL_ref1    7 2523.2 2550.4 -1254.6   2509.2 0.2218  1     0.6376
# Create a vector of the fitted values
Fitted3 <- fitted(BPRSL_ref3)

# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>%
  mutate(Fitted3)

# draw the plot of BPRSL
ggplot(BPRSL, aes(x = week, y = Fitted3, col = subject)) +
  geom_line(aes(linetype = treatment)) +
  scale_x_continuous(name = "week", breaks = seq(0, 8, 2)) +
  scale_y_continuous(name = "Fitted weight (grams)") +
  theme(legend.position = "right")

This model has AIC slightly smaller than BPRSL_ref1 but overall this is not an improvement on BPRSL_ref1. Hence, the model that considers random slope and random intercept is goos for this data.